{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2678a633-ed43-4556-aeea-84113aeffd22",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "In clinical care settings, ECG interpretations are often recorded as free-text. It can be challenging to translate these into binary labels for training and evaluation due to synonyms, acronyms, grammar, typographical errors, evolving medical terminology, and implied findings.\n",
    "\n",
    "Specifically, we:\n",
    "- Apply pattern matching (patterns were hand-curated), maintaining positional information\n",
    "- Derive a series of entities (e.g., 'tachycardia', 'infarction'), descriptors (e.g., 'probably', 'moderate', 'acute'), and connectives (e.g., 'associated with', 'transitions to')\n",
    "- Distill elevant information from the descriptors and connectives down into their corresponding entities\n",
    "- Apply a knowledge graph encoding label relationships to recursively mark labels as true, e.g., labeling *Ventricular tachycardia* when *Torsades de Pointes* was stated.\n",
    "- Map the resulting entities into labels which can be flexibly manipulated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db10252b-2d62-403e-b143-88d84ed1e575",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "root = os.path.dirname(os.getcwd())\n",
    "labeler_dir = os.path.join(root, 'data/mimic_iv_ecg/labeler')\n",
    "\n",
    "labeler_results_file = os.path.join(labeler_dir, 'labeler_res.pkl')\n",
    "if os.path.exists(labeler_results_file):\n",
    "    os.remove(labeler_results_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64ac0597-2ff5-4801-91da-7c6cea597b83",
   "metadata": {},
   "source": [
    "# Load labeler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13f66320-8743-4b13-8e41-571909c1f0c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tqdm\n",
    "!pip install networkx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "500dc4d3-c78a-4e82-961c-8dbec5dd18e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.insert(0, os.path.join(root, 'labeler/'))\n",
    "\n",
    "from pattern_labeler import PatternLabelerConfig, PatternLabeler\n",
    "from preprocess import preprocess_texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd442a26-7495-4d95-b34f-736a1c6e227f",
   "metadata": {},
   "outputs": [],
   "source": [
    "labeler_config = PatternLabelerConfig.from_json(labeler_dir)\n",
    "labeler_config.entity_templates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5c8488f-4b0f-4ea2-96c0-689d7bcad4aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "labeler = PatternLabeler(labeler_config)\n",
    "labeler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9607311a-3c08-46de-a62b-8c8948c510d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "labeler.plot_ancestor_subgraph(\n",
    "    \"Tachycardia\",\n",
    "    figsize=(15, 8),\n",
    "    node_size=100,\n",
    "    font_size=8,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41cbcc1a-6f39-4b56-8580-6e592ec51408",
   "metadata": {},
   "source": [
    "# Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3adf7521-0698-40e7-a7de-3baeececec11",
   "metadata": {},
   "outputs": [],
   "source": [
    "interpretations = pd.Series([\n",
    "    \"Sinus rhythm; Possible right atrial abnormality\",\n",
    "    \"Sinus tach; Normal electrocardiogram except for rate\",\n",
    "    \"Normal sinus rhtyhm; Normal ECG; missing lead v2\",\n",
    "    \"Accelerated idioventricular rhythm; LAD; Borderline ECG\",\n",
    "    \"Stach with PVC(s); Possible seotal infarct; Undefined\",\n",
    "])\n",
    "texts = preprocess_texts(interpretations.copy())\n",
    "texts.rename(\"text\", inplace=True)\n",
    "texts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6288620a-1c4d-4736-b38b-dec700e0fed8",
   "metadata": {},
   "source": [
    "# Parse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de031bff-d2ef-4d00-b655-f4c505f472c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "labeler_res = labeler(\n",
    "    texts=texts.copy(),\n",
    "    restore_path=labeler_results_file,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa537950-69ba-497f-bdbd-e4890cd77e69",
   "metadata": {},
   "source": [
    "# Analyze results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d94c6d5f-73c4-464d-833c-bf1f055b2254",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "with open(labeler_results_file, \"rb\") as f:\n",
    "    labeler_res = pickle.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe0304fb-0898-446d-9c24-86a669baf37e",
   "metadata": {},
   "source": [
    "## View unmatched text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddad2c24-e4c6-433e-a091-cb3873cd8240",
   "metadata": {},
   "outputs": [],
   "source": [
    "unmatched = labeler_res.text_results['unmatched'][\n",
    "    labeler_res.text_results['unmatched'] != ''\n",
    "].copy()\n",
    "unmatched = unmatched.str.replace(\"[^\\w\\s]\", \"\", regex=True).str.strip()\n",
    "unmatched = unmatched[unmatched != ''].copy()\n",
    "unmatched"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b034629d-00f4-4478-8822-c485d82342df",
   "metadata": {},
   "source": [
    "# Create labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8806b1ef-83d6-49ac-a3c0-25b1aa1ccb24",
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_flat = labeler_res.labels_flat.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bd0024e-038e-4cb1-8284-d3cad59f8f8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "vcs = labels_flat[\n",
    "    ~labels_flat['name'].str.contains(\" - \", regex=False)\n",
    "]['name'].value_counts()\n",
    "vcs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd46f9cb-7a51-470e-a551-5c3ae048eb8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Removed from UHN labels:\n",
    "# Normal sinus rhythm\n",
    "# 2nd degree atrioventricular block\n",
    "# Ventricular pacing\n",
    "# Atrial pacing\n",
    "\n",
    "CONFIRM_LABELS = \"\"\"\n",
    "Poor data quality\n",
    "Sinus rhythm\n",
    "Sinus tachycardia\n",
    "Premature ventricular contraction\n",
    "Tachycardia\n",
    "Right atrial abnormality\n",
    "\"\"\".split(\"\\n\")\n",
    "CONFIRM_LABELS = [label for label in CONFIRM_LABELS if label != \"\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27249dbe-6c57-4bb4-8421-849368c8bc6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "vcs.loc[CONFIRM_LABELS] / len(texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "749ce30f-4b04-4558-9c27-ee37cfecc9cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_flat_final = labels_flat[labels_flat[\"name\"].isin(CONFIRM_LABELS)]\n",
    "labels = pd.get_dummies(labels_flat_final['name'])[CONFIRM_LABELS]\n",
    "labels.index.name = 'idx'\n",
    "labels = labels.groupby('idx').any()\n",
    "\n",
    "# Add in rows which had no labels\n",
    "no_label_rows = pd.DataFrame(index=texts.index[~texts.index.isin(labels.index)].copy(), columns=CONFIRM_LABELS)\n",
    "no_label_rows.loc[:, :] = False\n",
    "labels = pd.concat([labels, no_label_rows]).sort_index()\n",
    "labels.index.name = 'idx'\n",
    "labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af02506d-1079-40a8-b5a4-db63a46ebcea",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(labels) == len(texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68aab80e-7272-4229-8fd7-309e2793f664",
   "metadata": {},
   "source": [
    "# Labeler definition example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90147f77-954f-41db-ad9a-ad5aaa4e7468",
   "metadata": {},
   "source": [
    "If you're looking to define your own labeler, it can be easier to start from Python code, rather than writing the JSON. It can then be converted to JSON for easier distribution and versioning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc8b2df0-17de-44d4-b79b-84338460b8c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Dict, List, Optional, Union\n",
    "\n",
    "from pattern_labeler import (\n",
    "    AttachedDescriptorTem,\n",
    "    CompoundTem,\n",
    "    Connective,\n",
    "    DescriptorTem,\n",
    "    Entity,\n",
    "    EntityPattern,\n",
    "    EntityTem,\n",
    "    SplitDescriptorTem,\n",
    "    TravelingDescriptorsTem,\n",
    "    DescriptorPattern\n",
    ")\n",
    "\n",
    "ENTITY_TEMPLATES: List[EntityTem] = [\n",
    "    EntityTem(\"Sinus rhythm\"),\n",
    "    EntityTem(\"Arrhythmia\"),\n",
    "    EntityTem(\"Tachycardia\", sup=\"Arrhythmia\"),\n",
    "    EntityTem(\"Sinus tachycardia\", sup=[\"Sinus rhythm\", \"Tachycardia\"]),\n",
    "    EntityTem(\"Ectopic beat\"),\n",
    "    EntityTem(\"Ectopic ventricular contraction\", sup=\"Ectopic beat\"),\n",
    "    EntityTem(\"Bifascicular block\"),\n",
    "    EntityTem(\"Right bundle branch block\"),\n",
    "    EntityTem(\"Fascicular block\"),\n",
    "]\n",
    "\n",
    "ENTITY_PATTERNS: List[EntityPattern] = [\n",
    "    EntityPattern(\"tachycardia\", \"Tachycardia\"),\n",
    "    EntityPattern(\"ectopic beat\", \"Ectopic beat\"),\n",
    "]\n",
    "\n",
    "DESCRIPTOR_TEMPLATES: List[DescriptorTem] = [\n",
    "    DescriptorTem(\"Severe\", category=\"severity\"),\n",
    "    DescriptorTem(\"Sinus\", category=\"location\"),\n",
    "    DescriptorTem(\"Atrial\", category=\"location\"),\n",
    "    DescriptorTem(\"Ventricular\", category=\"location\"),\n",
    "    DescriptorTem(\"Multiple\", category=\"quantity\"),\n",
    "    DescriptorTem(\"Possible\", category=\"uncertainty\"),\n",
    "    DescriptorTem(\"Probable\", category=\"uncertainty\"),\n",
    "]\n",
    "\n",
    "DESCRIPTOR_PATTERNS: List[DescriptorPattern] = [\n",
    "    DescriptorPattern(\"severe\", \"Severe\"),\n",
    "    DescriptorPattern(\"atrial\", \"Atrial\"),\n",
    "    DescriptorPattern(\"possible\", \"Possible\"),\n",
    "    DescriptorPattern(\"probably\", \"Probable\"),\n",
    "    DescriptorPattern(\"multiple\", \"Multiple\"),\n",
    "]\n",
    "\n",
    "# === Connectives ===\n",
    "CONNECTIVES: List[Connective] = [\n",
    "    Connective(\"and\"),\n",
    "    Connective(\"suggests\", descriptors=[None, \"Probable\"], tags=\"causal\"),\n",
    "]\n",
    "\n",
    "SPLIT_DESCRIPTOR_TEMPLATES: List[SplitDescriptorTem] = [\n",
    "    SplitDescriptorTem(\n",
    "        \"Atrioventricular\",\n",
    "        split=[\"Atrial\", \"Ventricular\"],\n",
    "        patterns=\"atrioventricular\",\n",
    "    ),\n",
    "]\n",
    "\n",
    "COMPOUND_TEMPLATES: List[CompoundTem] = [\n",
    "    CompoundTem(\n",
    "        \"Bifascicular block\",\n",
    "        [\"Right bundle branch block\", \"Fascicular block\"],\n",
    "    ),\n",
    "]\n",
    "\n",
    "ATTACHED_DESCRIPTOR_TEMPLATES: List[AttachedDescriptorTem] = [\n",
    "    AttachedDescriptorTem(\"Atrial tachycardia\", \"Tachycardia\", \"Atrial\"),\n",
    "]\n",
    "\n",
    "TRAVELING_DESCRIPTOR_TEMPLATES: List[TravelingDescriptorsTem] = [\n",
    "    TravelingDescriptorsTem(\"Ectopic Beat\", [\"Multiple\"]),\n",
    "]\n",
    "\n",
    "UNCERTAINTY_MAP: Dict[str, float] = {\n",
    "    \"Possible\": 0.5,\n",
    "    \"Probable\": 0.7,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "662f8f1c-cb70-49de-a6f9-14f69b191119",
   "metadata": {},
   "outputs": [],
   "source": [
    "labeler_config = PatternLabelerConfig(\n",
    "    ENTITY_TEMPLATES,\n",
    "    ENTITY_PATTERNS,\n",
    "    descriptor_templates=DESCRIPTOR_TEMPLATES,\n",
    "    descriptor_patterns=DESCRIPTOR_PATTERNS,\n",
    "    split_descriptor_templates=SPLIT_DESCRIPTOR_TEMPLATES,\n",
    "    connectives=CONNECTIVES,\n",
    "    compound_templates=COMPOUND_TEMPLATES,\n",
    "    attached_descriptor_templates=ATTACHED_DESCRIPTOR_TEMPLATES,\n",
    "    traveling_descriptor_templates=TRAVELING_DESCRIPTOR_TEMPLATES,\n",
    "    uncertainty_map=UNCERTAINTY_MAP,\n",
    ")\n",
    "labeler = PatternLabeler(labeler_config)\n",
    "labeler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32e40bd8-7832-4af6-bc70-5b8d77d17efe",
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_labeler_dir = os.path.join(root, 'data/custom_labeler')\n",
    "labeler_config.to_json(custom_labeler_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df2adca5-06ad-4e0f-acfc-16504bdf3f2d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "fairseq",
   "language": "python",
   "name": "fairseq"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
